Chad Pickering, 913328497
Corresponded with: Edie Espejo, Patrick Vacek, Graham Smith, Ricky Safran, Nivi Achanta, Sierra Tevlin, Hannah Kosinovsky, Janice Luong
Resources: A variety of package documentation. Lost to the void. It's 5:10am. I can't remember anymore.
Instructions: In this assignment, you'll scrape text from The California Aggie and then analyze the text.
The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.
The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.
Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:
Have a parameter url for the URL of the article list.
Have a parameter page for the number of pages to fetch links from. The default should be 1.
Return a list of aricle URLs (each URL should be a string).
Test your function on 2-3 different categories to make sure it works.
Hints:
Be polite to The Aggie and save time by setting up requests_cache before you write your function.
Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.
You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.
# Import packages:
import lxml
import requests
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
from urllib2 import Request, urlopen
import pandas as pd
import numpy as np
import re
import functools
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
plt.style.use('ggplot')
%matplotlib inline
from fastcache import clru_cache
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib import pyplot as plt
from scipy.sparse import csr_matrix
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from itertools import izip
import random
1.1 Answer: The following two functions extract links to articles in an Aggie article list, the first extracting one page url, and the second extracting more than one page url, where the term 'page' means article url.
@clru_cache(maxsize=128,typed=False)
# Retrieves and parses one url:
def one_page(url):
url_input = urlopen(url)
bs_parsed = BeautifulSoup(url_input, "html.parser")
return(bs_parsed)
@clru_cache(maxsize=128,typed=False)
# Retrieves and parses multiple urls on multiple pages using one_page() function (1 page as default):
def mult_pages(url, pages = 1):
# Calls one_page() on pages indicated and outputs content from each url on each page
page_list = [one_page(url + "/page/" + str(i) + "/") for i in range(1, pages + 1)]
# Identifies "h2" tag amongst text and retrieves all context in tag for all pages in previously defined list
url_list = [[text.findNext().get("href") for text in page.findAll("h2")] for page in page_list]
# Gives list of all urls on pages specified as one single list without differentiating per page
urls = [url for index in url_list for url in index]
return(urls)
# Test function on campus url with >1 page; other tests ran and passed
# mult_pages("https://theaggie.org/sports", pages = 3)
# mult_pages("https://theaggie.org/campus", pages = 3)
mult_pages("https://theaggie.org/arts", pages = 2)
Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:
Have a parameter url for the URL of the article.
For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.
Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.
For example, for this article your function should return something similar to this:
{
'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
'title': 'Project Toto aims to address questions regarding city finances',
'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}
Hints:
The author line is always the last line of the last paragraph.
Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark.
You can convert most of these to ASCII characters with the method call (on a string)
.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.
url_test = "https://theaggie.org/2017/02/15/suspect-in-davis-islamic-center-vandalism-arrested/"
@clru_cache(maxsize=128,typed=False)
def article_cont(url):
# Parses content, outputs html block
url_input = urlopen(url)
bs_parsed = BeautifulSoup(url_input, "html.parser")
### Title ###
# Tags to be traversed to get text to be used for titles if try fails
title_exc = bs_parsed.find_all("div", {"itemprop", "articleBody"})
try:
title = bs_parsed.find_all("h1", {"class": "entry-title"})[0].text.encode('ascii', 'ignore')
except:
try:
title = title_exc.find_all("strong")[0].text.encode('ascii', 'ignore')
except:
title = np.NaN
### Text ###
# Tags to be traversed to find where the body of the entry is
text_body = bs_parsed.find_all("div", {"class", "entry-content"})[0]
# Get paragraphs, join as one body of text, and remove unicode
ind_paras = [text_body.find_all("span", {"style":"font-weight: 400;"})[x].text for x in range(len(text_body.find_all("span", {"style":"font-weight: 400;"})))]
paras_all = "".join(ind_paras[:-1]).strip().encode('ascii', 'ignore')
### Author ###
# Retrieve part of articleBody in which author is contained
itemprop_body = str([item.text for item in bs_parsed.find_all("div") if item.get("itemprop") == "articleBody"])
try:
author = re.split("Written [Bb]y: ", itemprop_body, re.IGNORECASE)[1]
author = re.split("\\\\", author.encode('ascii', 'ignore'))[0]
except: # Use regex to explicitly search for first and last name if try fails
try:
author_search = re.search("Written [Bb]y: [A-z]{1,20} [A-z]{2,20}", bs_parsed.text)
author = author_search.group(0)
author = re.split("Written [Bb]y: ", author)[1]
except:
author = np.NaN
### Dictionary ###
dic = {"author": author, "text": paras_all, "title": title, "url": url}
return(dic)
article_cont(url_test)
Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.
The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.
# Campus and city articles
campus_pages = mult_pages("https://theaggie.org/campus", pages = 4)
city_pages = mult_pages("https://theaggie.org/city", pages = 4)
combined_pages = campus_pages + city_pages
all_cont = [article_cont(l) for l in combined_pages]
aggie_df = pd.DataFrame(all_cont)
# Create source column
cat = np.array(["campus", "city"])
aggie_df["source"] = np.repeat(cat, [60, 60], axis=0)
aggie_df
Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.
What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?
What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?
Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.
Hints:
The nltk book and scikit-learn documentation may be helpful here.
You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.
If you want, you can use the wordcloud package to plot a word cloud. To install the package, run
conda install -c https://conda.anaconda.org/amueller wordcloud
in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.
What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?
Aggie topics in general:
Analysis: The following is a word cloud of the most common terms in titles, and then the text bodies, for all articles scraped in the data frame. We notice that recent events such as the protests are common in the titles, as well as other political key words or news about the new chancellor to get people's attention and succinctly describe the content of the article. Whereas in the text, we see verbs and adverbs are more common, words that describe rather than attract immediate attention. Further, the distribution of frequencies of words for the terms in the text bodies are much more skewed toward the smaller end because, as expected, more kinds of words are used in the text rather than the title.
# Word cloud for all 120 articles:
categories = ["title", "text"]
stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2)
for cat in categories:
rel_text = list((" ".join(aggie_df[cat])).split(" "))
terms = [term for term in rel_text if term not in stopwords]
terms = " ".join(terms)
# Word cloud image:
wc = WordCloud(background_color = "white", max_words=1000, stopwords=stopwords, width=800, height=400)
wc.generate(terms)
print(cat)
# Color:
def col_gray(word, font_size, position, orientation, random_state=None, **kwargs):
return "hsl(50, 0%%, %d%%)" % random.randint(10, 50)
# Plot:
plt.figure(figsize=(20,10))
plt.imshow(wc.recolor(color_func = col_gray, random_state=3))
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
Campus vs. city articles:
Analysis: I analyze titles instead of the text for determining if campus and city articles cover different topics because the title is a better measure of "key words" that describe topics. Campus articles look like they cover more political topics, be it national politics or ASUCD politics (senate, etc.). We see topics involving the new chancellor, sustainability, recent protests, and campus services. On the other hand, city articles involve community topics such as food, public events, art, residential issues, and local campaigns. Political issues seem to overlap, and so do local and state events, which makes intuitive sense.
# Word cloud for campus and city titles:
sources = ["campus", "city"]
stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2)
for source in sources:
rel_text = list((" ".join(aggie_df.loc[aggie_df['source'] == source]["title"])).split(" "))
terms = [term for term in rel_text if term not in stopwords]
terms = " ".join(terms)
# Word cloud image:
wc = WordCloud(background_color = "white", max_words=1000, stopwords=stopwords, width=800, height=400)
wc.generate(terms)
print(source)
# Color:
def col_gray(word, font_size, position, orientation, random_state=None, **kwargs):
return "hsl(50, 0%%, %d%%)" % random.randint(10, 50)
# Plot:
plt.figure(figsize=(20,10))
plt.imshow(wc.recolor(color_func = col_gray, random_state=3))
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
Barplot of most common words in text:
Filtering out stop words, we see that the most common words are directly community and Davis-related, as well as generally common words one would find in any body of writing.
stopwords = STOPWORDS
stopwords2 = set(["UC", "Davis", "new", "news", "police", "logs", "might", "also", "come", "don't", "student", "said", "will", "Yolo", "really", "going", "day", "students", "year", "city", "campus", "last", "week"])
stopwords = set(stopwords).union(stopwords2) # 'UC' and 'Davis' do not filter
w0 = [re.findall(r'\w+', aggie_df['text'][x]) for x in range(len(aggie_df))]
w1 = [item for sublist in w0 for item in sublist]
wordz = [word for word in w1 if word.lower() not in stopwords]
term_freq = Counter(wordz).most_common()[0:100]
termfreq_df = pd.DataFrame(term_freq)
termfreq_df.columns = ['term', 'freq']
termfreq_df = termfreq_df.head(20)
lbls = list(termfreq_df.ix[:,0])
indexes = np.arange(len(termfreq_df))
freqs = list(termfreq_df.ix[:,1])
plt.bar(indexes, freqs, align = 'center', alpha = 0.5)
plt.xticks(indexes, lbls, rotation=75, fontsize = 9)
plt.ylabel('Frequency of term')
plt.title('Most Common Terms in the Text of Articles Scraped')
plt.show()
What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?
Analysis: Here, a sparse matrix is used, where the similarity score is based on the number of words that the pair of articles share and the frequency of those words. This method is moderately to severely flawed because length of article directly affects the score - there is no weighting method to account for number of words for a pair of articles. This methodology means that sometimes, a similiarity score between two different articles is higher than the similarity score between one of those articles and itself because of the difference in relative length between the two articles and how much they have in common. Despite this, the top three pairs based on thsi measure are shown, and 20 terms shared between them are displayed in a convenient data frame below. The following code finds the similarity scores between all combinations of articles, then filters out the articles whose maximum similarity score is with itself.
# From Lesson 11:
stemmer = PorterStemmer().stem
tokenize = nltk.word_tokenize
def stem(tokens,stemmer = PorterStemmer().stem):
return [stemmer(w.lower()) for w in tokens]
def lemmatize(text):
return stem(tokenize(text))
def that_one_time_i_made_a_coo(param):
tuples = izip(param.row, param.col, param.data)
return sorted(tuples, key=lambda x: (x[0], x[2]), reverse = True)
vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm=None)
tfs = vectorizer.fit_transform(aggie_df["text"])
sim = tfs.dot(tfs.T)
sim_mx = csr_matrix(sim)
y = sim_mx.tocoo()
order_mx = that_one_time_i_made_a_coo(y)
same_art = order_mx[0::120]
most_sim = []
for x in range(len(same_art)):
if (same_art[x][0] != same_art[x][1]) == True:
most_sim.append(same_art[x])
sim_df = sorted(most_sim, key=lambda x: x[2], reverse = True)
sim_df
The top 3 most similar articles are as follows:
print aggie_df["title"][35], "\n", aggie_df["title"][14] # most similar
print aggie_df["title"][27], "\n", aggie_df["title"][16] # second most similar
print aggie_df["title"][119], "\n", aggie_df["title"][16] # third most similar
Analysis: The words the article pairs have in common are as follows. We can see that in the first pair, as expected, words involving mental health, collaboration, and personal terminology are very common, which, by the look of the two titles, agree with the content sufficiently. In the second and third pairs we see more community and political terminology, which is, again, expected because of the titles we see in both pairs. For specifics, see below:
mostsim1 = set.intersection(set(aggie_df["text"][35].split(" ")), set(aggie_df["text"][14].split(" ")))
mostsim2 = set.intersection(set(aggie_df["text"][27].split(" ")), set(aggie_df["text"][16].split(" ")))
mostsim3 = set.intersection(set(aggie_df["text"][119].split(" ")), set(aggie_df["text"][16].split(" ")))
sim_list = [mostsim1, mostsim2, mostsim3]
sim_heads = [pd.DataFrame([s]).T.head(20) for s in sim_list]
sim_df = pd.concat([sim_heads[0], sim_heads[1], sim_heads[2]], axis=1)
sim_df.columns = ['Pair 1', 'Pair 2', 'Pair 3']
sim_df
Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.
Analysis: This corpus is not representative of the Aggie whatsoever. Our sample is neither random nor sufficiently large (compared to the size of the body of articles that exist in archived form online) to make any meaningful inference that is representative of the articles that the Aggie has published. First of all, the Aggie writes in many subcategories that we have not considered (outside of campus, arts, etc.), so to make inference on the body of work from a small subset of the categories available is extremely short-sighted. Second, to make inference to all articles, we must sample randomly from all work ever published by the Aggie (in reality, also those published not online); this leads to my last point that the sample is a very small subset of the larger set. The combination of these factors make this sample of 120 (60 from campus and 60 from city) an extremely poor representation of the Aggie's articles. Additionally, because the sampling is only from the past couple months, the terms are incredibly biased toward the nouns and such that have been in news cycles lately, not overall in the time frame when the Aggie has been in publication.
This corpus can support inference within the time frame of the past few months (in the range of the articles we scraped) and only in the campus and city subcategorities. (Fun fact: Despite the great amount of work that I put into this assignment, I would not trust anything I did to make any meaningful inference at all as much of this is a crude representation of what is actually truth. I want you to know this as I enter the next stage of my life where I could start making decisions that affect people's lives.)